This  CVPR2013  paper  is  the  Open  Access  version,  provided  by  the  Computer  Vision  Foundation. 
The  authoritative  version  of  this  paper  is  available  in  IEEE  Xplore. 


Complex  Event  Detection  via  Multi-Source  Video  Attributes 

Zhigang  Ma^  Yi  Yang*  Zhongwen  Xu*§  Shuicheng  Yarf  Nicu  Sebe*  Alexander  G.  Hauptmann* 
*Department  of  Information  Engineering  and  Computer  Science,  University  of  Trento,  Italy 
*School  of  Computer  Science,  Carnegie  Mellon  University,  USA 
^College  of  Computer  Science,  Zhejiang  University,  China 
^Department  of  Electrical  and  Computer  Engineering,  National  University  of  Singapore,  Singapore 

{ma, sebe}@disi . unitn . it  {yiyang, zhongwen, alex}@cs . emu . edu  eleyans@nus . edu . sg 


Abstract 

Complex  events  essentially  include  human,  scenes,  ob¬ 
jects  and  actions  that  can  be  summarized  by  visual  at¬ 
tributes,  so  leveraging  relevant  attributes  properly  could  be 
helpful  for  event  detection.  Many  works  have  exploited  at¬ 
tributes  at  image  level  for  various  applications.  However, 
attributes  at  image  level  are  possibly  insufficient  for  com¬ 
plex  event  detection  in  videos  due  to  their  limited  capabil¬ 
ity  in  characterizing  the  dynamic  properties  of  video  data. 
Hence,  we  propose  to  leverage  attributes  at  video  level 
(named  as  video  attributes  in  this  work),  i.e.,  the  seman¬ 
tic  labels  of  external  videos  are  used  as  attributes.  Com¬ 
pared  to  complex  event  videos,  these  external  videos  con¬ 
tain  simple  contents  such  as  objects,  scenes  and  actions 
which  are  the  basic  elements  of  complex  events.  Specifi¬ 
cally,  building  upon  a  correlation  vector  which  correlates 
the  attributes  and  the  complex  event,  we  incorporate  video 
attributes  latently  as  extra  informative  cues  into  the  event 
detector  learnt  from  complex  event  videos.  Extensive  ex¬ 
periments  on  a  real-world  large-scale  dataset  validate  the 
efficacy  of  the  proposed  approach. 


1.  Introduction 

In  this  paper,  we  focus  on  the  event  detection  of  large- 
scale  real-world  videos  [2,  3].  An  “event”  refers  to  an  ob¬ 
servable  occurrence  that  interests  users  and  is  found  in  spe¬ 
cific  scenes  and  is  characterized  by  the  subjects  and  ob¬ 
jects  involved  [15].  In  the  past,  detection  of  events  that 
are  simple,  well-defined  and  describable  by  a  short  video 
sequence,  e.g .,  hand  shaking ,  has  been  widely  studied. 
In  the  real  world,  however,  users  are  more  interested  in 
videos  depicting  complex  events  such  as  celebrating  the 
New  Year.  Complex  event  detection  is  very  challenging  as 
these  events  usually  contain  many  people  and/or  objects, 
various  human  actions,  multiple  scenes;  have  significant 
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Figure  1.  The  illustration  of  our  approach  for  complex  event  de¬ 
tection  with  video  attributes. 


intra-class  variations;  and  take  place  in  much  longer  video 
clips  [2,  3,  15,  16].  Despite  the  arduousness,  the  practical 
significance  of  complex  event  detection  has  drawn  increas¬ 
ing  interest  from  researchers  [23,  10,  15,  16].  For  example, 
Ma  et  al.  have  introduced  the  first  exploration  of  Ad  Hoc 
multimedia  event  detection  when  there  are  only  10  positive 
examples  for  training  [15].  However,  the  area  of  research 
remains  in  its  infancy,  thus  motivating  us  to  ask  for  more 
satisfying  performance.  As  complex  events  usually  con¬ 
tain  visual  attributes  related  to  people,  scenes,  objects  and 
human  actions  (e.g.,  Figure  1  shows  that  a  complex  event 
parkour  is  relevant  with  push  ups,  building,  etc.),  leveraging 
these  attributes  properly  could  be  helpful  for  the  detection. 

Visual  attributes  were  introduced  as  describable  proper¬ 
ties  of  an  object  and  have  been  applied  to  many  applica¬ 
tions  [5,  7,  9].  Visual  attributes  can  be  either  at  local  level 
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or  global  level.  For  example,  “many  people  ”  is  a  local- 
level  attribute  for  the  event  flash  moh  gathering.  Yet  this 
kind  of  attributes  is  mostly  defined  manually,  which  is  time- 
consuming  and  requires  expertise.  Instead,  we  can  use  at¬ 
tributes  at  global  level  which  are  the  semantic  labels  of  im¬ 
ages  [19].  For  instance,  an  image  with  its  semantic  label 
tennis  can  be  leveraged  for  understanding  the  event  play¬ 
ing  tennis.  Given  that  this  type  of  attributes  is  associated 
with  images,  we  regard  it  as  image  attributes.  Using  image 
attributes  for  complex  event  detection  is  intuitively  limited 
as  image  attributes  usually  cannot  characterize  the  dynamic 
properties  of  complex  event  videos  (complex  event  videos 
refer  to  the  videos  depicting  complex  events).  In  this  pa¬ 
per,  we  therefore  propose  an  idea  of  video  attributes  and 
particularly  apply  it  for  complex  event  detection.  Video  at¬ 
tributes,  in  our  work,  indicate  the  semantic  labels  of  other 
external  videos  collected  by  researchers.  Note  that  these 
external  videos  are  different  from  complex  event  videos. 
Compared  to  complex  event  videos,  the  external  videos  con¬ 
tain  simple  contents  of  people,  objects,  scenes  and  actions 
which  are  basic  elements  of  complex  events.  For  example, 
a  video  with  its  semantic  label  mixing  batter  is  useful  for 
understanding  the  complex  event  making  a  cake.  As  the  ex¬ 
ternal  videos  are  used  by  treating  their  semantic  labels  as 
video  attributes,  we  call  these  videos  attribute  videos. 

To  use  video  attributes,  we  may  refer  to  a  typical  ap¬ 
proach  that  involves  training  attribute  classifiers  and  then 
using  their  outputs  as  intermediate  representations  for  the 
complex  event  videos  [8,  11].  But  this  approach  has  two 
problems.  First,  when  the  number  of  attributes  used  is  lim¬ 
ited,  it  is  insufficient  to  learn  a  discriminative  intermedi¬ 
ate  representation.  Second,  given  a  particular  event  to  de¬ 
tect,  only  some  attributes  are  discriminative  while  others  are 
comparatively  useless  or  even  noisy  [16].  It  is  difficult  to 
decide  what  attributes  to  use  for  different  events.  In  con¬ 
trast,  we  propose  to  use  video  attributes  as  additional  in¬ 
formation  to  assist  complex  event  detection.  Specifically, 
our  framework  learns  the  attribute  classifier  and  event  de¬ 
tector  simultaneously.  The  observation  of  a  particular  event 
affects  the  attribute  classifier,  and  in  return,  attributes  char¬ 
acterize  the  event.  This  kind  of  mutual  influence  is  explored 
by  a  correlation  vector,  which  helps  incorporate  extra  infor¬ 
mative  cues  into  the  event  detector.  We  name  the  proposed 
method  Multi-level  Collaborative  Regression  (MCR).  Our 
approach  has  two  merits:  the  learning  process  of  event  de¬ 
tector  is  not  solely  dependent  on  the  video  attributes;  and 
the  joint  framework  adapts  the  knowledge  from  attributes 
for  different  events,  i.e.,  a  particular  event  obtains  dedicated 
perks  via  the  joint  learning  of  attribute  classifier  and  event 
detector. 

Moreover,  we  propose  to  integrate  multiple  features  from 
both  complex  event  videos  and  attribute  videos  for  learning 
the  detector  as  combining  multiple  features  has  proved  to  be 


beneficial  for  visual  analysis  [22].  On  the  other  hand,  exist¬ 
ing  video  collections  have  different  themes.  As  we  expect 
the  video  attributes  to  be  diverse,  in  our  framework  video  at¬ 
tributes  from  different  collections  are  utilized.  To  this  end, 
we  illustrate  our  approach  in  Figure  1 . 

The  main  contributions  of  this  paper  are  as  follows:  First, 
we  propose  using  video  attributes  for  complex  event  detec¬ 
tion.  Second,  video  attributes  are  used  latently  as  additional 
information  for  learning  the  event  detector.  Third,  multiple 
attribute  video  sets  with  different  features  are  sewed  seam¬ 
lessly  with  multiple  features  from  the  complex  event  videos. 

2.  Related  Work 

Visual  attributes  were  advocated  as  the  describable  prop¬ 
erties  of  objects  [8].  For  example,  an  object  bear  can  be  de¬ 
scribed  by  attributes  such  as  furry  and  four  legs.  Attributes 
are  both  machine-detectable  and  human-understandable, 
so  they  have  been  widely  used  for  various  applications. 
Wang  et  al.  have  proposed  a  discriminative  model  for  ob¬ 
ject  recognition  [21].  The  attributes  of  an  object  are  treated 
as  latent  variables  and  the  correlations  among  attributes  are 
used  to  classify  object  classes.  A  method  to  learn  visual 
attributes  and  object  classes  together  has  been  presented 
in  [20].  Duan  et  al.  have  presented  an  interactive  approach 
which  discovers  local  attributes  that  are  both  discriminative 
and  semantically  meaningful  for  fine-grained  category  clas¬ 
sification  [7].  Hwang  et  al.  have  proposed  to  explore  the 
shared  features  between  objects  and  their  attributes  for  ani¬ 
mal  and  scene  classification  [9].  Dhar  et  al.  have  leveraged 
high  level  describable  attributes  for  selecting  high  aesthetic 
quality  images  and  interesting  ones  from  large  image  col¬ 
lections  [5].  However,  to  generate  local  attributes  usually 
requires  a  manually  defining  process  which  is  burdensome. 
An  alternative  way  is  to  leverage  attributes  at  a  global  level, 
i.e.,  the  semantic  labels  of  visual  data.  Its  convenience  is 
that  we  have  many  labeled  datasets  covering  a  wide  range 
of  themes.  By  treating  the  semantic  labels  of  these  data  as 
attributes,  we  can  readily  leverage  them. 

In  the  past,  global-level  image  attributes  have  been 
widely  used  [19,  14].  For  example,  Luo  et  al.  have  pre¬ 
sented  an  object  classification  method  by  casting  prior  fea¬ 
tures  obtained  from  global  image  attributes  of  auxiliary  im¬ 
ages  into  their  multiple  kernel  learning  framework  [14].  For 
recognition  or  detection  tasks  in  videos,  image  attributes 
probably  cannot  well  characterize  the  dynamic  properties 
which  could  hamper  their  contributions. 

In  [16],  Ma  et  al.  have  proposed  learning  an  intermedi¬ 
ate  representation  for  event  detection.  In  their  approach,  the 
intermediate  representation  is  the  same  for  the  event  videos 
and  the  attribute  videos.  However,  it  could  be  natural  to  as¬ 
sume  that  the  events  and  attributes  are  different  depictions 
of  videos  at  different  levels.  In  addition,  the  intermediate 
representation  is  unexplainable.  Differently,  we  leverage 
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video  attributes  to  characterize  complex  events,  which  is  in¬ 
terpretable.  In  our  framework,  we  also  learn  an  attribute 
classifier  which  can  be  used  to  predict  the  attributes  of  a 
given  video.  Since  the  attribute  classifier  is  jointly  opti¬ 
mized  with  the  event  detector,  the  related  attribute  classifier 
is  more  accurate  in  uncovering  the  attributes  from  an  event 
video.  For  example,  by  exploiting  the  videos  of  “landing 
a  fish”,  the  concept  classifier  “fish”  can  be  more  accurately 
trained  and  vice  versa.  In  addition,  as  a  byproduct  of  our 
method,  the  attribute  representation  can  be  further  used  for 
other  applications  such  as  multimedia  event  recounting  [6]. 


3.  Video  Attributes  Assisted  Event  Detection 


We  first  correlate  the  features  of  attribute  videos  from 
m  multiple  sources  with  their  semantic  labels  respectively. 
The  features  of  different  sources  can  be  different.  Fol¬ 
lowing  [15,  16],  we  perform  full  rank  principal  component 
analysis  [18]  to  map  the  features  into  a  Hilbert  space  %.  De- 

~  m 


note  their  representations  in  %  as  Vi  G  RdiXni  where 

2  =  1 

di  is  the  dimension  and  rii  indicates  the  number  of  videos. 
Suppose  the  semantic  labels  are  Ai\1^1  G  MniXCi  where  q 
is  the  number  of  classes,  we  propose  the  following  regres¬ 
sion  loss: 


min 

Qi 


£  VlQi-Ai 

2=1 


where  Qi  G  MdiXc*  associates  Vi  with  Ai.  Next  we  illus¬ 
trate  how  to  learn  a  detector  for  the  complex  event  by  in¬ 
corporating  the  attribute  videos.  Similarly  we  first  map  the 
multiple  features  of  the  complex  event  videos  into  %  and 

~  m 


denote  the  resulted  representations  as  Xi 

2=1 

where  n  is  the  number  of  complex  event  videos.  We  first 
propose  to  learn  multiple  detectors  wi  G  W 
Xi  with  the  ground  truth  labels  y  G  Mnxl: 


iXl 


to  associate 


related  to  people,  scenes,  objects  and  human  actions,  the 
two  domains  would  have  some  shared  knowledge.  Inspired 
by  previous  works  [4],  we  assume  that  a  correlation  vector 
Pi  G  Mc* x  1  exists  to  establish  the  correspondence  between 
Qi  and  wx .  Thus,  Eq  (3)  is  extended  as: 


f  E  F»T(' 

Wi,Qi,pi,fi  — '  II 
2=1 


Wi 


+  PQiPi)  -  fi  +  ||  fi  -  v\\l,  (4) 


where  /3  is  a  parameter  to  control  the  influence  of  the  at¬ 
tribute  videos  on  the  event  detection.  To  this  end,  our  ob¬ 
jective  function  is  formulated  as  follows: 


min  y^llv^Qi-Ai 

wi,Qi,Pi,fi  4^  II 


(5) 


+a  (|lf  (wi  +PQiPi)  -  /i||2  +  Wfi-yWl^  +7IMI2  > 

where  the  last  item  is  added  to  avoid  over- fitting.  For  a 
testing  video,  (wi  +  fiQiPi)  is  used  for  prediction. 


4.  Optimization  Procedure 

We  propose  an  alternating  approach  to  optimize  the  ob¬ 
jective  function  in  Eq  (5). 

First,  we  fix  px  and  optimize  fx,  Wi  and  Qi.  By  setting 
the  derivative  of  Eq  (5)  w.r.t.  fi  to  zero,  we  have: 

fi  =  (x? ( m  +  PQiPi)  +  y )  /2.  (6) 

Substituting  Eq  (6)  into  Eq  (5)  we  obtain: 

min  FTQz  —  Ai\\  +  a  \\Xj (wi  +  PQiPi)  -  y\\ 

II  f  II  II 2  (7) 

+7IKII2- 

By  setting  the  derivative  of  Eq  (7)  w.r.t.  Wi  to  zero,  it  be¬ 
comes: 


min  y^\\Xfwi-y  . 
Wi  II  2 

2=1 


(2) 


On  top  of  the  above  function,  we  aim  to  correlate  differ¬ 
ent  feature  types  in  a  joint  framework.  It  is  expected  that 
the  learning  process  from  different  feature  types  is  sewed 
seamlessly  to  obtain  better  Wi.  Hence,  we  bring  in  the  pre¬ 
dicted  labels  fi  G  RnXl  for  each  feature  type  and  minimize 
the  following  objective: 


Wi  =  aBpXiV  -  apBpXiXj QiPi  (8) 

where  B,  =  aXiXj  +  7 1.  Substituting  Eq  (8)  into  Eq  (7) 
we  have: 

m  2 

minV  Wf'Qi  -  Ai\\  +  aTr(2a/3yT  Xf  B~x  XiXf  QiPi 
Qi  II  IIf 

-a/32PT  Qi 'XiXf  B-'XiXf  QiPi  +  /32pf  Qf  XiXf QiPi  (5 
-2  PpfQjXiy). 


m 

min  Xj Wi  —  fi 

WiJi^W 

2=1 


2  +  II  fi  -  v\\l 


(3) 


Now  we  show  how  to  incorporate  the  attribute  videos  for 
optimizing  Wi.  Since  the  attribute  videos  and  the  complex 
event  videos  are  relevant,  i.e.,  complex  events  are  usually 


By  setting  the  derivative  of  Eq  (9)  w.r.t.  Qi  to  zero,  we  arrive 
at: 


2VV?Qi  -  2 ViAi  +  2 ofpXiXTB^Xiypf 
-2o?^X%XfB-xXiXfQ%p%pT%  +‘ip1X%XfQ%PlpT%  (10) 
-2  pxiyp[  =  0 
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which  can  be  rewritten  as: 

QiiPipIy1  +  (FiViVfy'XiX?  -  cfpyViV?)-1 
xixlB-1xixT')  Qi  -  (ViVfy'ViMpipT)-1  1} 

+a2/3(vivyy1xix^  ByXiypJ  (j PiPJy 1 

-py/ivfy'xivpj  (pipfy1  =  o. 

The  above  problem  can  be  solved  by  the  Sylvester  equa¬ 
tion  [1]. 

After  Qi ,  Wi  and  fi  are  obtained,  we  fix  them  and  opti¬ 
mize  pi.  By  setting  the  derivative  of  Eq  (5)  w.r.t.  pi  to  zero, 
we  have: 

Pl  =  {PQjXiXTQiy'iQTXifi  -  QjXiXjWi).  (12) 

Thereby,  we  propose  the  algorithm  shown  in  Algo¬ 
rithm  1  to  optimize  the  objective  function  in  Eq  (5). 


Algorithm  1:  Optimization  procedure  for  MCR. 

Input: 

Vi  G  Ai  G  Mn*xc\  Xi  G  RdiXn,  y  G  Mnxl; 

Parameters  a ,  /3  and  7. 

Output: 

Optimized  K7  G  RdiXl,  Qi  G  RdiXCi,pi  G  MCiXl 
and  fi  G  Mnxl. 

1:  Set  t  —  0  and  initialize  pi  G  MCiXl  randomly; 

2:  repeat 

Compute  fi  according  to  Eq  (6); 

Compute  Wi  according  to  Eq  (8) ; 

Solve  the  Sylvester  equation  in  Eq  (1 1)  to  get  Qi, 
Update  pi  according  to  Eq  (12); 
t  =  t  +  1. 

until  Convergence:  \objt+i  —  objt  |  / objt  <  10-3 
(obj  indicates  the  objective  function  value)', 

3:  Return  Wi,  Qi,  pi  and  fi. 


5.  Experiments 

In  this  section  we  present  the  experiments  that  evaluate 
the  proposed  method  for  complex  event  detection. 

5.1.  Datasets 

The  TRECVID  MED  2012  development  set  (MED  12) 
is  used  for  complex  event  detection.  MED  12  consists  of 
50328  video  clips  which  are  related  to  20  events:  Birth¬ 
day  party,  Changing  a  vehicle  tire.  Flash  mob  gathering. 
Getting  a  vehicle  unstuck.  Grooming  an  animal.  Making  a 
sandwich.  Parade,  Parkour,  Repairing  an  appliance.  Work¬ 
ing  on  a  sewing  project.  Attempting  a  bike  trick.  Clean¬ 
ing  an  appliance.  Dog  show.  Giving  directions  to  a  loca¬ 
tion,  Marriage  proposal,  Renovating  a  home.  Rock  climb¬ 
ing,  Town  hall  meeting.  Winning  a  race  without  a  vehicle 
and  Working  on  a  metal  crafts  project. 


Another  two  video  sets,  i.e.,  the  UCF50  dataset  [17]  and 
the  development  set  from  TRECVID  2012  semantic  index¬ 
ing  task  are  used  as  attribute  videos.  UCF50  includes  6681 
video  sequences  with  50  action  categories.  The  video  set 
for  TRECVID  2012  semantic  indexing  (SIN)  task  covers 
346  concepts.  We  use  65  concepts  suggested  by  [6].  These 
concepts  are  related  to  human,  scenes  and  objects  which  are 
the  elements  of  events.  The  sampled  subset  contains  3244 
data  and  we  denote  it  as  SIN  12. 

We  extract  STIP  [12]  and  SIFT  [13]  descriptors  for 
the  videos  of  MED  12,  STIP  for  UCF50  and  SIFT  for 
SIN12.  After  that,  a  32768  dimension  spatial  BoW  feature 
is  formed  for  STIP/SIFT  to  represent  each  video. 

5.2.  Comparison  Algorithms 

(1)  MCR:  The  proposed  method  in  this  paper.  As  x2 
kernel  has  proved  to  be  advantageous  for  BoW  feature,  we 
exploit  it  to  map  the  features  of  MED  12,  UCF50  and  SIN  12 
into  the  Hilbert  space. 

(2)  Baseline:  We  set  /?  in  Eq  (5)  to  0  so  that  no  video 
attributes  are  exploited  in  our  approach.  The  resulting  algo¬ 
rithm  works  as  the  baseline. 

(3)  SVM:  SVM  is  an  effective  tool  for  complex  event  de¬ 
tection  and  has  been  widely  used  by  several  research  groups 
for  TRECVID  MED,  e.g.,  [23].  Similarly,  x2  kernel  is  used. 

(4)  Attributes  Intermediate  Representation  (AIR):  We 
train  attribute  classifiers  using  UCF50  and  SIN  12.  Then  we 
apply  the  classifiers  on  MED  12  and  use  their  outputs  as  the 
intermediate  representations.  SVM  is  applied  on  the  new 
representations  afterwards  for  event  detection. 

5.3.  Setup 

For  each  event,  we  randomly  choose  100  positive  exam¬ 
ples  and  1000  negative  examples  from  MED  12  to  form  the 
training  set.  The  remaining  data  of  MED  12  are  used  as  the 
testing  set. 

There  are  two  types  of  parameters.  The  first  type  in¬ 
cludes  the  parameters  for  kernel  calculation.  It  is  fixed 
to  the  mean  of  the  pairwise  distances  among  the  training 
samples  as  done  in  [14].  The  second  type  includes  the 
regularization  parameters.  We  tune  them  uniformly  from 
{0.001, 0.1, 10, 1000}  for  all  the  algorithms  and  we  report 
the  best  results  for  each  algorithm. 

We  use  three  evaluation  metrics.  Minimum  NDC  (Min- 
NDC)  and  the  Probability  of  Miss-Detection  based  on  the 
Detection  Threshold  12.5  (Pmd@  12.5)  are  two  official  eval¬ 
uation  metrics  used  by  NIST  in  TRECVID  MED  [2]  [3]. 
Lower  MinNDC  or  Pmd@12.5  indicates  better  detection 
performance.  The  third  one  is  Average  Precision  (AP). 
Higher  AP  indicates  better  performance. 
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Table  1.  Detection  results  using  different  algorithms.  LOWER  MinNDC  /  LOWER  Pmd@  12.5  /  HIGHER  AP  indicates  BETTER  perfor¬ 
mance.  The  best  results  are  highlighted  in  bold.  Relative  Improvement  indicates  our  advantage  over  the  runner-up,  if  applicable. 


Event  Description 

Evaluation  Metric 

Baseline 

SVM 

AIR 

MCR 

Relative  Improvement 

MinNDC 

0.900 

0.877 

1.000 

0.858 

2.2% 

Birthday  party 

Pmd@12.5 

0.516 

0.498 

0.989 

0.484 

2.9% 

AP 

0.064 

0.068 

0.007 

0.076 

11.7% 

MinNDC 

0.895 

0.753 

1.000 

0.719 

4.7% 

Changing  a  vehicle  tire 

Pmd@12.5 

0.529 

0.443 

0.979 

0.436 

1.6% 

AP 

0.032 

0.058 

0.003 

0.069 

19.0% 

MinNDC 

0.463 

0.467 

0.721 

0.420 

10.2% 

Flash  mob  gathering 

Pmd@12.5 

0.239 

0.249 

0.394 

0.230 

3.9% 

AP 

0.225 

0.225 

0.087 

0.248 

10.2% 

MinNDC 

0.710 

0.607 

0.957 

0.559 

8.6% 

Getting  a  vehicle  unstuck 

Pmd@12.5 

0.391 

0.326 

0.710 

0.355 

N/A% 

AP 

0.071 

0.095 

0.014 

0.118 

24.2% 

MinNDC 

0.935 

0.908 

1.000 

0.855 

6.2% 

Grooming  an  animal 

Pmd@12.5 

0.532 

0.511 

0.957 

0.511 

N/A 

AP 

0.026 

0.029 

0.003 

0.034 

17.2% 

MinNDC 

0.950 

0.905 

0.985 

0.888 

1.9% 

Making  a  sandwich 

Pmd@12.5 

0.546 

0.540 

0.741 

0.517 

4.4% 

AP 

0.032 

0.037 

0.012 

0.039 

5.4% 

MinNDC 

0.761 

0.747 

0.991 

0.683 

9.4% 

Parade 

Pmd@12.5 

0.391 

0.407 

0.579 

0.374 

4.5% 

AP 

0.123 

0.124 

0.044 

0.141 

13.7% 

MinNDC 

0.610 

0.576 

0.878 

0.534 

7.9% 

Parkour 

Pmd@12.5 

0.384 

0.344 

0.528 

0.344 

N/A 

AP 

0.092 

0.108 

0.030 

0.117 

8.3% 

MinNDC 

0.728 

0.689 

0.935 

0.630 

9.4% 

Repairing  an  appliance 

Pmd@12.5 

0.402 

0.386 

0.614 

0.378 

2.1% 

AP 

0.064 

0.066 

0.019 

0.084 

27.3% 

MinNDC 

0.817 

0.753 

0.964 

0.721 

4.4% 

Working  on  a  sewing  project 

Pmd@12.5 

0.475 

0.475 

0.639 

0.459 

3.5% 

AP 

0.042 

0.042 

0.015 

0.048 

14.3% 

MinNDC 

0.692 

0.556 

1.000 

0.559 

N/A 

Attempting  a  bike  trick 

Pmd@12.5 

0.433 

0.333 

0.800 

0.333 

N/A 

AP 

0.015 

0.022 

0.001 

0.025 

13.6% 

MinNDC 

0.978 

0.957 

1.000 

0.852 

12.3% 

Cleaning  an  appliance 

Pmd@12.5 

0.600 

0.700 

0.900 

0.467 

28.5% 

AP 

0.005 

0.004 

0.001 

0.007 

40.0% 

MinNDC 

0.545 

0.434 

0.943 

0.390 

11.3% 

Dog  show 

Pmd@12.5 

0.300 

0.267 

0.600 

0.200 

33.5% 

AP 

0.028 

0.037 

0.004 

0.043 

16.2% 

MinNDC 

0.862 

0.875 

1.000 

0.844 

3.7% 

Giving  directions  to  a  location 

Pmd@12.5 

0.670 

0.667 

0.967 

0.667 

N/A 

AP 

0.005 

0.005 

0.001 

0.006 

20% 

MinNDC 

0.824 

0.774 

1.000 

0.777 

N/A 

Marriage  proposal 

Pmd@12.5 

0.533 

0.500 

0.967 

0.500 

N/A 

AP 

0.008 

0.011 

0.001 

0.011 

N/A 

MinNDC 

0.821 

0.821 

1.000 

0.735 

11.7% 

Renovating  a  home 

Pmd@12.5 

0.567 

0.533 

0.867 

0.467 

14.1% 

AP 

0.008 

0.009 

0.001 

0.013 

44.4% 

MinNDC 

0.659 

0.670 

0.949 

0.575 

14.6% 

Rock  climbing 

Pmd@12.5 

0.431 

0.433 

0.633 

0.400 

7.8% 

AP 

0.017 

0.016 

0.004 

0.023 

35.3% 

MinNDC 

0.706 

0.607 

1.000 

0.532 

14.1% 

Town  hall  meeting 

Pmd@12.5 

0.467 

0.367 

1.000 

0.300 

22.3% 

AP 

0.016 

0.023 

0.007 

0.020 

N/A 

MinNDC 

0.683 

0.585 

0.887 

0.565 

3.5% 

Winning  a  race  without  a  vehicle 

Pmd@12.5 

0.433 

0.333 

0.667 

0.367 

N/A 

AP 

0.018 

0.021 

0.004 

0.023 

9.5% 

MinNDC 

0.822 

0.750 

0.947 

0.690 

8.7% 

Working  on  a  metal  crafts  project 

Pmd@12.5 

0.500 

0.400 

0.633 

0.400 

N/A 

AP 

0.009 

0.012 

0.004 

0.018 

50.0% 

MinNDC 

0.768 

0.716 

0.958 

0.669 

7.0% 

Average 

Pmd@12.5 

0.467 

0.436 

0.758 

0.409 

6.6% 

AP 

0.045 

0.053 

0.013 

0.061 

15.1% 
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5.4.  Event  Detection  Results 

Table  1  lists  the  detection  results.  It  can  be  seen  that  our 
method  MCR  is  consistently  competitive  for  all  the  events. 
Specifically,  we  observe  that:  1)  when  using  MinNDC  and 
Pmd@12.5  as  metrics,  MCR  gains  the  best  performance 
for  18  events;  2)  when  using  AP  as  metric,  MCR  is  the 
best  method  for  19  events;  3)  MCR  obtains  the  top  per¬ 
formance  for  the  average  accuracy  over  all  the  20  events; 
4)  MCR  is  much  better  than  the  Baseline,  indicating  that 
harnessing  video  attributes  does  boost  the  performance  of 
complex  event  detection;  5)  SVM  is  the  second  competi¬ 
tive  algorithm,  which  is  in  accordance  with  previous  expe¬ 
rience  of  several  research  groups  in  TRECVID  MED;  6)  for 
those  events  on  which  MCR  achieves  the  top  performance, 
it  outperforms  SVM  with  clear  gap.  For  instance,  MCR  is 
10%-75%  better  than  SVM  for  16  events  in  terms  of  AP. 
The  promising  performance  of  MCR  verifies  that  leverag¬ 
ing  video  attributes  properly  is  beneficial  for  complex  event 
detection. 

5.5.  Results  using  Single  Feature  and  Single  Source 

In  this  part,  we  only  use  UCF50+MED12  with  STIP 
feature  and  SIN12+MED12  with  SIFT  feature  for  com¬ 
plex  event  detection  to  show  the  performance  change.  As 
SVM  is  the  second  competitive  algorithm,  we  also  show 
its  performance  variation  w.r.t.  STIP  feature  and  SIFT  fea¬ 
ture.  Due  to  the  space  limit,  we  only  show  the  results  us¬ 
ing  AP  as  metric  for  this  experiment.  The  results  are  dis¬ 
played  in  Figure  2.  It  is  observed  that:  1)  MCR  using  both 
UCF50  and  SIN12  together  with  STIP+SIFT  features  is  bet¬ 


ter  than  that  using  UCF50+MED12  with  STIP  feature  for 
all  the  events;  2)  MCR  using  both  UCF50  and  SIN12  to¬ 
gether  with  STIP+SIFT  features  is  generally  better  than  that 
using  SIN12+MED12  with  SIFT  feature,  yet  the  former  is 
weaker  than  the  latter  for  three  events,  which  is  presumably 
data-dependent;  3)  SVM  has  similar  performance  variation; 
and  4)  our  method  still  yields  better  results  than  SVM  when 
using  one  feature  type.  This  experiment  validates  that  ex¬ 
ploiting  multiple  attribute  video  sets  together  with  different 
features  is  beneficial  for  most  cases. 


6.  Conclusions 

We  have  proposed  a  method  for  utilizing  the  attributes 
at  video  level  for  complex  event  detection.  Video  attributes 
are  convenient  to  use  for  complex  event  detection  as  many 
video  collections  relevant  to  people,  scenes,  objects  and  ac¬ 
tions  are  available.  Meanwhile,  video  attributes  have  more 
potentials  than  image  attributes  to  characterize  the  dynamic 
properties  of  video  data.  Unlike  the  traditional  approach 
which  maps  the  video  data  into  attribute  space,  our  method 
learns  a  correlation  vector  which  correlates  video  attributes 
and  a  complex  event.  Built  upon  this,  the  extra  informa¬ 
tive  cues  learnt  from  attribute  videos  are  further  incorpo¬ 
rated  into  the  event  detector.  We  have  performed  extensive 
experiments  using  a  real-world  large-scale  video  dataset  to 
evaluate  the  efficacy  of  our  method  on  complex  event  de¬ 
tection.  The  results  are  encouraging  and  have  verified  the 
advantage  of  leveraging  video  attributes  properly. 
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